This report demonstrates a detailed Exploratory Data Analysis (EDA) for the final project data. The step 1 variables are analyzed first, followed by the step 2 variables. You were not required or expected to go to this level of depth in the EDA portion of the final project. This report was created to provide an example for how I would study a data set before training models. It discusses how the visualizations can help provide insight and context to models which can ultimately help interpret their behavior.
This report uses the tidyverse suite, which is loaded below. It also uses the corrplot and GGally packages. Those two packages will be accessed with the :: notation in the relevant report sections rather than loading directly.
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.1
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data_url <- 'https://raw.githubusercontent.com/jyurko/INFSCI_2595_Fall_2020/master/HW/final_project/infsci_2595_final_project_data.csv'
df <- readr::read_csv(data_url, col_names = TRUE)
## Parsed with column specification:
## cols(
## xA = col_character(),
## xB = col_character(),
## x01 = col_double(),
## x02 = col_double(),
## x03 = col_double(),
## x04 = col_double(),
## x05 = col_double(),
## x06 = col_double(),
## x07 = col_double(),
## x08 = col_double(),
## x09 = col_double(),
## x10 = col_double(),
## x11 = col_double(),
## response_1 = col_double(),
## outcome_2 = col_character()
## )
Separate the data set into the variables associated with Step 1 and the variables associated with Step 1. Notice that the “Option A” version of the data excludes the continuous response, response_1 from the step_2_a_df data set.
step_1_df <- df %>% select(xA, xB, x01:x06, response_1)
step_2_b_df <- df %>% select(xA, xB, response_1, x07:x11, outcome_2) %>%
mutate(outcome_2 = factor(outcome_2, levels = c("Fail", "Pass")))
step_2_a_df <- df %>% select(xA, xB, x01:x11, outcome_2) %>%
mutate(outcome_2 = factor(outcome_2, levels = c("Fail", "Pass")))
It is always important to check for missing values before starting any data analysis exercise. The code chunk below uses the df object and counts the number of missing values for each column. The purrr::map_dbl() function is used to iterate over the columns in the tibble. As we can see below, there are no missing entries in this application.
df %>% purrr::map_dbl(~sum(is.na(.)))
## xA xB x01 x02 x03 x04 x05
## 0 0 0 0 0 0 0
## x06 x07 x08 x09 x10 x11 response_1
## 0 0 0 0 0 0 0
## outcome_2
## 0
Step 1 corresponds to a regression problem. So after studying the input variables, we must visualize the relationship between the response and the inputs.
Let’s first consider the counts of the categorical variables. We can display the counts using the count() function from dplyr.
step_1_df %>% count(xA)
## # A tibble: 2 x 2
## xA n
## <chr> <int>
## 1 A1 847
## 2 A2 1166
Or we can visualize the counts with bar charts.
step_1_df %>%
ggplot(mapping = aes(x = xA)) +
geom_bar() +
theme_bw()
Let’s now consider the second categorical variable in this data set.
step_1_df %>%
ggplot(mapping = aes(x = xB)) +
geom_bar() +
theme_bw()
We can visualize the combination of the two categorical variables several different ways. The graphic below uses the bar fill to represent the level of xA.
step_1_df %>%
ggplot(mapping = aes(x = xB)) +
geom_bar(mapping = aes(fill = xA),
position = "dodge") +
ggthemes::scale_fill_colorblind() +
theme_bw()
We can visualize the fraction rather than the counts a few different ways. The code chunk below sets the position argument to "fill" to reveal each level of xB has the same fraction of observations associated with xA='A2'.
step_1_df %>%
ggplot(mapping = aes(x = xB)) +
geom_bar(mapping = aes(fill = xA),
position = "fill") +
ggthemes::scale_fill_colorblind() +
theme_bw()
Let’s now consider the continuous inputs. The code chunk below reshapes the step_1_df data set from wide to long-format. This enables using ggplot2 functionality such as facet_wrap() to associated a separate subplot (facet) for each variable.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = value)) +
geom_histogram(bins = 45) +
facet_wrap(~key, scales = "free") +
theme_bw() +
theme(axis.text.y = element_blank())
If you look closely at the reshape operation, the categorical variables, xA and xB, were not “gathered up” into the key column. This was intentional in order to preserve the categorical levels associated with each observation. With this setup we can “split” the data by the categorical variables. Let’s create a separate histogram for each of the six inputs broken up by the values of xA. The y aesthetic is mapped to the density rather than the counts in the code chunk below, because we saw that there are more observations for 'A2' than 'A1'.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = value)) +
geom_freqpoly(bins = 31,
mapping = aes(y = stat(density),
color = xA),
size = 1.1) +
facet_wrap(~key, scales = "free") +
ggthemes::scale_color_colorblind() +
theme_bw() +
theme(axis.text.y = element_blank())
Alternatively we could consider the CDFs instead of the densities to compare the distributions across groups. It’s quite easy to tell that the distributions are very similar using the CDFs.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = value)) +
stat_ecdf(mapping = aes(color = xA,
linetype = xA),
size = 1.1) +
facet_wrap(~key, scales = "free_x") +
ggthemes::scale_color_colorblind() +
theme_bw()
Let’s now break up the distributions further by considering the xB variable. The graphic below uses the CDF to compare the input distributions for each xA and xB combination. The row facets are controlled by xA, thus each subplot allows comparing the distributions across the different levels of xB.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = value)) +
stat_ecdf(mapping = aes(color = xB,
linetype = xB),
size = 1.) +
facet_grid(xA ~ key, scales = "free_x") +
ggthemes::scale_color_calc() +
theme_bw()
We can use the original data set, step_1_df, to make CDF comparison for a single input since the above graphic is rather difficult to see. The code chunk below shows how to do that with the x02 variable.
step_1_df %>%
ggplot(mapping = aes(x = x02)) +
stat_ecdf(mapping = aes(color = xB,
linetype = xB),
size = 1.15) +
facet_wrap(~xA) +
ggthemes::scale_color_calc() +
theme_bw()
It appears that the quantiles of the distribution are being shifted slightly based on the xA and xB level. The overall shape though appears similar across the various groups. We can check the summary statistic comparisons easier with boxplots. The figure below uses the long-format data set again. The boxplots are constructed to allow comparing the summary stats for each input across the xB levels for a single xA level. The boxplots show that indeed the input distributions are “shifted” slightly between the levels of xB.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = xA, y = value)) +
geom_boxplot(mapping = aes(fill = xB, color = xB),
alpha = 0.35) +
facet_wrap(~key, scales = "free_y") +
ggthemes::scale_fill_calc() +
ggthemes::scale_color_calc() +
theme_bw()
We can use the above figure to also compare the summary stats across the levels of xA for a particular xB level. Alternatively we can make that comparison directly by modifying the aesthetics of the ggplot() figure.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = xB, y = value)) +
geom_boxplot(mapping = aes(fill = xA, color = xA),
alpha = 0.35) +
facet_wrap(~key, scales = "free_y") +
ggthemes::scale_fill_colorblind() +
ggthemes::scale_color_colorblind() +
theme_bw()
We can also focus directly on comparing the input means for each categorical group. The code chunk below uses the stat_summary() function with the fun.data argument set to mean_se. This calculates the mean of a group and includes the standard error interval as a linerange geom. The default is to show the $$1 standard error interval and so the fun.args is set to list(mult=2) to display the $$2 standard error interval. We derived the formula for this standard error earlier in the semester. Do you know what the formula is? Based on the graphic below, can you tell which categorical groups have different mean values?
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = xB, y = value)) +
stat_summary(fun.data = "mean_se",
fun.args = list(mult = 2),
mapping = aes(color = xA),
position = position_dodge(0.3)) +
facet_wrap(~key, scales = "free_y") +
ggthemes::scale_fill_colorblind() +
ggthemes::scale_color_colorblind() +
theme_bw()
Let’s now consider how the inputs relate to each other. We discussed how highly correlated inputs can impact the model fitting process and interpretability of the results. Thus, it is paramount to investigate if the inputs are highly correlated before training models. The cell below uses the corrplot package to create the correlation plot matrix for all six inputs.
step_1_df %>%
select(x01:x06) %>%
cor() %>%
corrplot::corrplot(type = "upper", method = "number")
The above correlation plot is based on all observations. Let’s now consider the correlation associated within a particular categorical group. For example, let’s calculate the correlation between the six inputs for the xA='A1' and xB='B1' group.
step_1_df %>%
filter(xA == "A1" & xB == "B1") %>%
select(x01:x06) %>%
cor() %>%
corrplot::corrplot(type = "upper", method = "number")
We could make separate plots for each of the combinations of xA and xB. However, let’s make one graphic which uses separate facets for each combination. The ggcorrplot and corrr packages have been developed over the last few years to help study correlations across grouping variables. I am not that familiar with those packages, and so I create the figure manually. The code chunk below relies on several more advanced data manipulation operations from tidyr which create nested data frames (data.frames can store other data.frames). The result shows the correlation plot between the six inputs for each xA and xB combination. As you can see below, the exact correlation coefficient values change across groups, but the values are rather low in magnitude. Thus, it does not appear that there are substantial correlations between the step 1 inputs.
step_1_df %>%
select(-response_1) %>%
group_by(xA, xB) %>%
tidyr::nest() %>%
mutate(cor_mat = purrr::map(data, cor)) %>%
mutate(cor_df = purrr::map(cor_mat,
function(mm){as.data.frame(mm) %>%
mutate(row_name = rownames(mm))})) %>%
dplyr::select(-data, -cor_mat) %>%
tidyr::unnest(cor_df) %>%
tidyr::gather(key = "input_name", value = "cor_coef", -xA, -xB, -row_name) %>%
ggplot(mapping = aes(x = input_name, y = row_name)) +
geom_tile(mapping = aes(fill = cor_coef)) +
geom_text(mapping = aes(label = round(cor_coef, 2)),
size = 2) +
facet_grid(xA ~ xB) +
scale_fill_gradient2("corr",
low = "red", mid = "white", high = "blue", midpoint = 0,
limits = c(-1, 1)) +
labs(x = "", y = "") +
theme_bw()
Alternatively, we could consider the relationships between the inputs using a pairs plot, which creates a scatter plot between each pair of variables. The GGally library includes the ggpairs() function which creates the pairs plot within the ggplot2 framework.
GGally::ggpairs(step_1_df,
columns = 3:8)
With ggpairs(), we can use aesthetics similar to those available in ggplot2. For example, we can color by xB to group by the levels of xB.
GGally::ggpairs(step_1_df,
columns = 3:8,
mapping = aes(color = xB))
Let’s now consider the continuous response in step 1, response_1. The graphic displays the marginal histogram which appears to be symmetric with very wide tails.
step_1_df %>%
ggplot(mapping = aes(x = response_1)) +
geom_histogram(bins = 55) +
theme_bw()
Let’s now break up by the categorical variables, starting with xA. The graphic below shows the distributions appear quite similar.
step_1_df %>%
ggplot(mapping = aes(x = response_1)) +
geom_freqpoly(mapping = aes(color = xA,
y = stat(density)),
bins = 41,
size = 1.15) +
ggthemes::scale_color_colorblind() +
theme_bw()
Next, let’s break up by xB for each level of xA. Based on the figure below, it looks the xB='B3' associated response_1 distribution appears different from the other groups.
step_1_df %>%
ggplot(mapping = aes(x = response_1)) +
geom_freqpoly(mapping = aes(color = xB,
y = stat(density)),
bins = 31,
size = 1.15) +
facet_wrap(~xA) +
ggthemes::scale_color_calc() +
theme_bw()
Let’s confirm the above finding by examining the response_1 CDF broken up by the categorical groups. As we can see below, the distribution of response_1 is “shifted” to more negative values when xB = 'B3' compared with the other xB levels.
step_1_df %>%
ggplot(mapping = aes(x = response_1)) +
stat_ecdf(mapping = aes(color = xB),
size = 1.1) +
facet_wrap(~xA) +
ggthemes::scale_color_calc() +
theme_bw()
Let’s examine the summary stats associated with response_1 for each categorical group. The conclusions will be the same as those shown by CDF, but it’s just another way of presenting the information. The graphic below shows the summary stats with boxplots grouped by xB with fill and color groupings associated with xA. In this way, we can see the summary stats are relatively similar between the xA levels for each level of xB. We can also see that the summary stats of response_1 are more negative when xB = 'B3' compared with the summary stats of the other three xA levels.
step_1_df %>%
ggplot(mapping = aes(x = xB, y = response_1)) +
geom_boxplot(mapping = aes(fill = xA, color = xA),
alpha = 0.35) +
ggthemes::scale_fill_colorblind() +
ggthemes::scale_color_colorblind() +
theme_bw()
Let’s now consider how response_1 relates to the continuous inputs. The graphic below uses the long-format version of the data in order to allow separate facets for each input. Thus, each subplot in the figure below shows the scatter plot between response_1 and an input.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = value, y = response_1)) +
geom_point(alpha = 0.5) +
facet_wrap(~key, scales = "free_x") +
theme_bw()
There does not seem to be much of a relationship between response_1 with any of the inputs. However, let’s consider grouping based on the categorical variable. Let’s first consider grouping by xA via the color aesthetic. As we can see below, it is still difficult to observe clear patterns or trends.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = value, y = response_1)) +
geom_point(alpha = 0.5,
mapping = aes(color = xA)) +
facet_wrap(~key, scales = "free_x") +
ggthemes::scale_color_colorblind() +
theme_bw() +
guides(color = guide_legend(override.aes = list(alpha = 1.0)))
Let’s include a geom_smooth() layer to help visualize major trends. The marker transparency has been set to alpha = 0.1 in the code chunk below to make it easier to see the smooth trend. There might be an increasing trend with respect to x1 and a decreasing trend with respect to x2.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = value, y = response_1)) +
geom_point(alpha = 0.1,
mapping = aes(color = xA)) +
geom_smooth(mapping = aes(color = xA)) +
facet_wrap(~key, scales = "free_x") +
ggthemes::scale_color_colorblind() +
theme_bw() +
guides(color = guide_legend(override.aes = list(alpha = 1.0)))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Let’s now consider breaking up by xB. As we can see below, there seems to be a clear distinction between the trends associated xB = 'B3' compared to the other xB levels. The trends appear to be non-linear. For x1 and x3, the trends for xB = 'B3' have maximums (concave down) while other xB levels appear to have minimums (concave up). For x2, both xB = 'B3' and xB = 'B4' appear to be concave down. The x04 through x06 do not seem to have trends.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = value, y = response_1)) +
geom_point(alpha = 0.1,
mapping = aes(color = xB)) +
geom_smooth(mapping = aes(color = xB)) +
facet_wrap(~key, scales = "free_x") +
ggthemes::scale_color_calc() +
theme_bw() +
guides(color = guide_legend(override.aes = list(alpha = 1.0)))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Let’s now consider the trends associated with each combination of xA and xB. The graphic below uses facet_grid() to show the scatter plot between response_1 with respect to each input. The horizontal facets correspond to each input, and the vertical facets correspond to levels of xB. Notice that the y-axis scale is different for each vertical facet since we saw xB = 'B3' consists of more negative values than the other levels. The color is used to denote the level of xA to check if there are differences in the trends at the same xB level. It appears that there are non-linear trends for some of the inputs (x01 through x03) and those trends depend on xB.
step_1_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1) %>%
ggplot(mapping = aes(x = value, y = response_1)) +
geom_point(alpha = 0.1,
mapping = aes(color = xA)) +
geom_smooth(mapping = aes(color = xA)) +
facet_grid(xB~key, scales = "free") +
ggthemes::scale_color_colorblind() +
theme_bw() +
guides(color = guide_legend(override.aes = list(alpha = 1.0)))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It is important to note that the above graphic visualizes the trends with respect to a single continuous input. It does not consider two or more continuous inputs together. An attempt to consider the interaction between continuous inputs is shown below for x01 and x03. The facets correspond to categorical groups and the color denotes if x03 is greater than its median. At least in this simple visualization there does not appear to be an interaction, but we will need dedicated predictive models to truly confirm this or not.
step_1_df %>%
ggplot(mapping = aes(x = x01, y = response_1)) +
geom_point(mapping = aes(color = x03 > median(x03)), alpha=0.25) +
geom_smooth(mapping = aes(color = x03 > median(x03))) +
facet_grid(xA ~ xB) +
scale_color_brewer(palette = "Set1") +
theme_bw() +
theme(legend.position = "top")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It appears that the step 1 inputs are not correlated to each and that correlation is not substantially impacted by the categorical variable groupings. It also seems that the continuous response, response_1 has non-linear trends with respect to the continuous inputs. Those trends seem to depend on the level of xB, with visually clear differences for the xB = 'B3' level. This suggests that we should consider non-linear features and interactions with categorical variables when we train predictive models.
We will now consider the variables associated with the second step in the process. The categorical variables are the same and so we will not repeat those visualizations. The response_1 continuous response is an input to the step 2 process, and we already looked at its marginal distribution. Thus we will start out with the input variables associated with step 2, x07 through x11. Then we will consider their correlations, including with response_1. And finally we will examine groupings associated with the binary outcome of interest.
The histograms for the step 2 inputs are shown below, with each facet showing a separate input. The step_2_df object is reshaped to a long-format to support creating the figure.
step_2_b_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1, -outcome_2) %>%
ggplot(mapping = aes(x = value)) +
geom_histogram(bins = 45) +
facet_wrap(~key, scales = "free") +
theme_bw() +
theme(axis.text.y = element_blank())
Let’s now break up the step 2 input histograms based on xA. As shown in the figure below, it looks like the distributions are are quite similar across the xA levels.
step_2_b_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1, -outcome_2) %>%
ggplot(mapping = aes(x = value)) +
geom_freqpoly(mapping = aes(color = xA,
y = stat(density)),
bins = 41,
size = 1.1) +
facet_wrap(~key, scales = "free") +
ggthemes::scale_color_colorblind() +
theme_bw() +
theme(axis.text.y = element_blank())
Let’s now break up by xB using histograms.
step_2_b_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1, -outcome_2) %>%
ggplot(mapping = aes(x = value)) +
geom_freqpoly(mapping = aes(color = xB,
y = stat(density)),
bins = 31,
size = 1.1) +
facet_wrap(~key, scales = "free") +
ggthemes::scale_color_calc() +
theme_bw() +
theme(axis.text.y = element_blank())
Since the above figure is rather busy, let’s use the CDF comparison instead.
step_2_b_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1, -outcome_2) %>%
ggplot(mapping = aes(x = value)) +
stat_ecdf(mapping = aes(color = xB),
size = 1.) +
facet_wrap(~key, scales = "free_x") +
ggthemes::scale_color_calc() +
theme_bw()
It appears that the distributions are shifted slightly based on the xB level. Let’s now focus on the summary stats using boxplots. As we can see below, the input summary stats (medians, 25th, and 75th quantiles) are shifted slightly based on the xB level.
step_2_b_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1, -outcome_2) %>%
ggplot(mapping = aes(x = xB, y = value)) +
geom_boxplot(mapping = aes(fill = xB, color = xB),
alpha = 0.35) +
facet_wrap(~key, scales = "free_y") +
ggthemes::scale_color_calc(guide = FALSE) +
ggthemes::scale_fill_calc(guide = FALSE) +
theme_bw()
Let’s now consider grouping by xB and xA. The fill and color of the boxplot in the graphic below represents the xA level while the x aesthetic corresponds to the xB level. As before, The facets denote the input. The summary stats appear quite similar across the xA levels for a given xB level, and the variation across xB levels is consistent with the previous figures.
step_2_b_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -response_1, -outcome_2) %>%
ggplot(mapping = aes(x = xB, y = value)) +
geom_boxplot(mapping = aes(fill = xA, color = xA),
alpha = 0.35) +
facet_wrap(~key, scales = "free_y") +
ggthemes::scale_color_colorblind() +
ggthemes::scale_fill_colorblind() +
theme_bw()
Let’s now study the correlations between the input variables. We will include response_1 in the following figure, but it will be renamed r01 to shorten its name. The graphic below uses corrplot() to study the correlation between the 6 inputs to step 2.
step_2_b_df %>%
select(x07:x11, response_1) %>%
rename(r01 = response_1) %>%
cor() %>%
corrplot::corrplot(type = "upper", method = "number")
We can also consider the correlation associated with each categorical group. The graphic below is created using the same approach from before, which relies on nested data frames. The correlation is low regardless of the grouping, similar to the structure of the step 1 inputs.
step_2_b_df %>%
select(-outcome_2) %>%
rename(r01 = response_1) %>%
group_by(xA, xB) %>%
tidyr::nest() %>%
mutate(cor_mat = purrr::map(data, cor)) %>%
mutate(cor_df = purrr::map(cor_mat,
function(mm){as.data.frame(mm) %>%
mutate(row_name = rownames(mm))})) %>%
dplyr::select(-data, -cor_mat) %>%
tidyr::unnest(cor_df) %>%
tidyr::gather(key = "input_name", value = "cor_coef", -xA, -xB, -row_name) %>%
ggplot(mapping = aes(x = input_name, y = row_name)) +
geom_tile(mapping = aes(fill = cor_coef)) +
geom_text(mapping = aes(label = round(cor_coef, 2)),
size = 2) +
facet_grid(xA ~ xB) +
scale_fill_gradient2("corr",
low = "red", mid = "white", high = "blue", midpoint = 0,
limits = c(-1, 1)) +
labs(x = "", y = "") +
theme_bw()
Let’s now consider the binary outcome, outcome_2. Let’s first look at the number of observations for each level. As we can see below, the binary outcome is roughly balanced.
step_2_b_df %>%
ggplot(mapping = aes(x = outcome_2)) +
geom_bar() +
theme_bw()
Next, let’s examine the behavior of the binary outcome associated with the categorical variables. We will start by grouping by xA. As we can see below, the classes (the levels of outcome_2) are roughly balanced for each level of xA.
step_2_b_df %>%
ggplot(mapping = aes(x = xA)) +
geom_bar(mapping = aes(fill = outcome_2),
position = 'dodge') +
scale_fill_brewer(palette = "Set1") +
theme_bw()
And now let’s consider the count of the binary outcome associated with the combinations of xA and xB. The levels of xB are denoted by the separate facets, with the facets horizontally positioned so we can directly compare the heights of the bars. It looks like xB = 'B2' has more observations of the 'Pass' class, while xB = 'B4' is more associated with the 'Fail' class. The other levels of xB are more balanced.
step_2_b_df %>%
ggplot(mapping = aes(x = xA)) +
geom_bar(mapping = aes(fill = outcome_2), position = "dodge") +
facet_grid(~xB) +
scale_fill_brewer(palette = "Set1") +
theme_bw()
The figure above displays the counts rather than fractions. I first like to examine the counts to get an idea of the sample size associated with the groupings. Let’s now look at the fraction of observations associated with each level of outcome_2 for each combination of xA and xB. The figure below confirms what we saw previously. The 'Fail' class occurs more frequently when xB = 'B4', while the 'Pass' class occurs more frequently when xB = 'B2'.
step_2_b_df %>%
ggplot(mapping = aes(x = xA)) +
geom_bar(mapping = aes(fill = outcome_2), position = "fill") +
geom_hline(yintercept = 0.5, color = 'white', linetype = 'dashed',
size = 1.) +
facet_grid(~xB) +
scale_fill_brewer(palette = "Set1") +
theme_bw()
Let’s now visualize the distributions of the continuous variables associated with step 2 grouped by the outcome_2 levels. Doing this allows us to see if intervals of the continuous variables are associated with higher frequencies of one of the outcome_2 levels. We have visualized distributions several different ways in this report. For this analysis we will use boxplots, so that way we can focus on the summary stats. The figure below calculates the summary stats for each step 2 continuous input, including response_1, for each of the levels of outcome_2.
step_2_b_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -outcome_2) %>%
ggplot(mapping = aes(x = outcome_2, y = value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~key, scales = "free_x") +
theme_bw()
It appears that response_1 has lower variability when outcome_2 = 'Pass' compared to when outcome_2 = 'Fail'. Let’s now “drill down” further by also grouping by xB. The graphic below uses the color and fill of the boxplot to represent outcome_2. In this way, we group by both xB and outcome_2. We are able to directly compare step 2 input summary statistics associated with failed observations at a particular xB level to those associated with passed observations. Interestingly, it appears that 'B1', 'B3', and 'B4' have differences in the response_1 value between the outcome_2 levels. The median response_1 for failed observations is greater than the 75th quantile for passed observations when xB = 'B1' and xB = 'B4', while the median for failed observations is less than the 25th quantile for passed observations. This suggests that that response_1 impacts the probability of outcome_2 = 'Fail'. The “direction” (more positive or more negative values) of response_1 could depend on xB.
step_2_b_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -outcome_2) %>%
ggplot(mapping = aes(x = xB, y = value)) +
geom_boxplot(mapping = aes(fill = outcome_2,
color = outcome_2),
alpha = 0.35) +
coord_flip() +
facet_wrap(~key, scales = "free_x") +
scale_fill_brewer(palette = "Set1") + scale_color_brewer(palette = "Set1") +
theme_bw()
Let’s now focus just on the mean value of the variable associated with the outcome_2 and xB groupings. Note that the horizontal scales are different in the figure below compared with the previous figure. The means for response_1 appear to be different within each level of xB based on the level of outcome_2. This suggests that failed observations are associated with different average response_1 values compared to passed observations, regardless of the xB level. Step 2 inputs x07 through x09 have at least one level of xB with “separation” between the means. This suggests that those inputs also impact the failure probability, but that influence depends on xB. It seems that step 2 inputs x10 and x11 do not have must of an effect on the failure probability, at least on average. We also have not considered any interactions up to this point so we cannot say for certain yet if x10 and x11 are not significant. This is just exploratory (we are graphically performing the t-test) but we are starting to get hints about the possible important features.
step_2_b_df %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -outcome_2) %>%
ggplot(mapping = aes(x = xB, y = value)) +
stat_summary(fun.data = "mean_se",
fun.args = list(mult = 2),
mapping = aes(color = outcome_2),
position = position_dodge(0.2)) +
coord_flip() +
facet_wrap(~key, scales = "free_x") +
scale_color_brewer(palette = "Set1") +
theme_bw()
Let’s consider the binary outcome relative to the values of the step 2 inputs. In class, we would visualize such figures with the binary outcome encoded as 1 for the event and 0 for the non-event. Let’s go ahead an use the same encoding for this visualization where the event of interest is outcome_2 = 'Fail'. The code chunk below creates the y variable using an ifelse() statement before reshaping the step_2_b_df object into long-format. As shown below, the encoded binary outcome is either 1 or 0. The geom_jitter() geom was used to have slight jitter or variation in the points.
step_2_b_df %>%
mutate(y = ifelse(outcome_2 == "Fail", 1, 0)) %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -outcome_2, -y) %>%
ggplot(mapping = aes(x = value, y = y)) +
geom_jitter(alpha = 0.2, height = 0.02) +
facet_wrap(~key, scales = "free_x") +
theme_bw()
It might be tempting to include a geom_smooth() layer to smooth out the trends, but we must be careful. We cannot just use the default arguments to geom_smooth(), because it treats the problem as a regression problem. We are dealing with a logistic regression application. Thus, we need to instruct geom_smooth() to use glm instead of the default smoothing options. The code chunk below shows how to do that by specifying method = 'glm' and including the family argument in the method.args argument to geom_smooth().
step_2_b_df %>%
mutate(y = ifelse(outcome_2 == "Fail", 1, 0)) %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -outcome_2, -y) %>%
ggplot(mapping = aes(x = value, y = y)) +
geom_jitter(alpha = 0.2, height = 0.02) +
geom_smooth(formula = y ~ x,
method = 'glm',
method.args = list(family = 'binomial')) +
facet_wrap(~key, scales = "free_x") +
theme_bw()
Let’s now break up the logistic regression trends based on xB. The color aesthetic is mapped to xB in the geom_smooth() layer in the code chunk below. We can see the clear difference in the behavior of the encoded binary outcome with respect to response_1. We knew already from earlier visualizations that xB = 'B3' had more negative values than the other levels. The figure below shows the trends in the event probability are also different when xB = 'B3' with respect to response_1. We can also see that xB = 'B2' has an increasing event probability with respect to x08 while the other levels of xB are negative or appear to be have no trend. This suggests that the behavior of the event probability with respect to x08 depends on xB.
step_2_b_df %>%
mutate(y = ifelse(outcome_2 == "Fail", 1, 0)) %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -outcome_2, -y) %>%
ggplot(mapping = aes(x = value, y = y)) +
geom_jitter(alpha = 0.2, height = 0.02) +
geom_smooth(formula = y ~ x,
method = 'glm',
method.args = list(family = 'binomial'),
mapping = aes(color = xB)) +
facet_wrap(~key, scales = "free_x") +
ggthemes::scale_color_calc() +
theme_bw()
The above visualizations set the formula argument to geom_smooth(). We did not have to do that, since the formula is just the default y ~ x. However it was included to demonstrate our logistic regression models are assuming a linear relationship between the input and the linear predictor. We can use other relationships in our generalized linear model. The code chunk below uses a 4 degree-of-freedom spline to check for non-linear behavior. The figure below is busy, but it seems that there may be non-linear behavior between the linear predictor of the event probability with respect to x07 and x08. It is difficult to tell and again this is just exploratory. We will use predictive models to confirm if such behavior is truly present.
step_2_b_df %>%
mutate(y = ifelse(outcome_2 == "Fail", 1, 0)) %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -outcome_2, -y) %>%
ggplot(mapping = aes(x = value, y = y)) +
geom_jitter(alpha = 0.2, height = 0.02) +
geom_smooth(formula = y ~ splines::ns(x, 4),
method = 'glm',
method.args = list(family = 'binomial'),
mapping = aes(color = xB)) +
facet_wrap(~key, scales = "free_x") +
ggthemes::scale_color_calc() +
theme_bw()
The previous figures grouped behavior by xB. We could also consider xA and the combination of xA and xB. The combination of the two makes the figures more complicated. The are best visualized in larger figure sizes. This report uses all default figure size options in RMarkdown so it will be difficult to examine them in detail. The code chunk below demonstrates how to make such a figure using facet_grid() to create vertical facets associated with the two levels of xA. Running the code chunk below produces the linear separation warning associated with logistic regression (the warning is displayed to the screen). This is an interesting insight into the problem that we should consider when moving forward with predictive models. Based on the previous set of visualizations, which of the input variables do you think is associated with this warning message?
step_2_b_df %>%
mutate(y = ifelse(outcome_2 == "Fail", 1, 0)) %>%
tibble::rowid_to_column() %>%
tidyr::gather(key = "key", value = "value", -rowid, -xA, -xB, -outcome_2, -y) %>%
ggplot(mapping = aes(x = value, y = y)) +
geom_jitter(alpha = 0.2, height = 0.02) +
geom_smooth(formula = y ~ splines::ns(x, 4),
method = 'glm',
method.args = list(family = 'binomial'),
mapping = aes(color = xB)) +
facet_grid(xA~key, scales = "free_x") +
ggthemes::scale_color_calc() +
theme_bw() +
theme(legend.position = "top")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Lastly, let’s examine the influence of the interactions between the inputs. We’ve seen that response_1 impacts the event probability, and that the trend depends on the level of xB. Since the change in slope appears to be quite strong with respect to xB, we will instead focus on other inputs while considering the grouping of the categorical variables. It looks like the event probability depends on x07 and x08 so we will focus on those two variables. The graphic below creates scatter plots between x07 and x08 with the markers colored by outcome_2.
step_2_b_df %>%
ggplot(mapping = aes(x = x07, y = x08)) +
geom_point(mapping = aes(color = outcome_2),
alpha = 0.25) +
facet_grid(xA ~ xB) +
scale_color_brewer(palette = "Set1") +
theme_bw() +
theme(legend.position = "top") +
guides(color = guide_legend(override.aes = list(alpha = 1.0)))
The scatter plot above reveals the failed observations appear to be clustered. Qualitatively these clusters appear consistent with the non-linear trends we saw in the smoothing plots. Again, we will need to use models to confirm such behavior.
We can try to “smooth” out the scatter plots using a 2D density estimate. The code chunk below creates separate density contour plots for each outcome_2 class. The red contours show the density contours associated with failed observations, and the blue contours show the density contours of the passed (non-failed) observations. Notice that for xB = 'B2' and xA = 'A1', there seems to be two modes for the failed class. The “valley” between the modes consists of the passed observations. Regions between the failed modes represent input combinations that would minimize the failure probability.
step_2_b_df %>%
ggplot(mapping = aes(x = x07, y = x08)) +
geom_density_2d(mapping = aes(color = outcome_2),
size = 1.) +
facet_grid(xA ~ xB) +
scale_color_brewer(palette = "Set1") +
theme_bw() +
theme(legend.position = "top") +
guides(color = guide_legend(override.aes = list(alpha = 1.0)))
We have seen that the step 2 inputs are not highly correlated with each other. We have also visualized that the failure probability depends on the categorical variables, and those variables appear to impact the trends of the failure probability with respect to the continuous step 2 inputs. We should therefore anticipate that we will need at least to consider predictive models that allow for interactions between the categorical and continuous inputs. We have also seen that failure probability linear predictor may be non-linearly related to the step 2 inputs. We should not be surprised if models that allow for non-linear trends out-perform models that only consider interactions.